RANDOM MATRICES: 
THE DISTRIBUTION OF THE SMALLEST SINGULAR VALUES 



TERENCE TAO AND VAN VU 



Abstract. Let £ be a real-valued random variable of mean zero and variance 1. 
Let Af n (£) denote the n x n random matrix whose entries are iid copies of £ and 
cr n (M n (£)) denote the least singular value of M„(£). The quantity er„(M„(£)) 2 is 
thus the least eigenvalue of the Wishart matrix M n M*. 

We show that (under a finite moment assumption) the probability distribution 
na n (M n (^)) 2 is universal in the sense that it does not depend on the distribution of 
£. In particular, it converges to the same limiting distribution as in the special case 
when £ is real gaussian. (The limiting distribution was computed explicitly in this 
case by Edelman.) 

We also proved a similar result for complex- valued random variables of mean zero, 
with real and imaginary parts having variance 1/2 and covariance zero. Similar results 
are also obtained for the joint distribution of the bottom k singular values of M„(£) 
for any fixed k (or even for k growing as a small power of n) and for rectangular 
matrices. 

Our approach is motivated by the general idea of "property testing" from com- 
binatorics and theoretical computer science. This seems to be a new approach in 
the study of spectra of random matrices and combines tools from various areas of 
mathematics. 



1. Introduction 

Let £ be a real or complex- valued random variable and M n (£) denote the random 
n x n matrix whose entries are i.i.d. copies of £. In this paper, we always impose one 
of the following two normalizations on £: 



-normalization) £ is real- valued with E£ = and E£ 2 = 1. 



(C-normalization) £ is complex- valued with E£ = 0, ERe(£) 2 = EIm(£) 
and ERe(0lm(f) = 0. 



2 _ T?T™/'(^2 _ 1 

2' 



Note in both cases £ has mean zero and variance one. A model example of a Ab- 
normalized random variable is the real gaussian = N(0, 1), while a model example of 
a C-normalized random variable is the complex gaussian gc whose real and imaginary 
parts are iid copies of -^Sa- Another IR-normalized random variable of interest is 
Bernoulli, in which £ equals +1 or —1 with an equal probability 1/2 of each. 

T. Tao is supported by a grant from the MacArthur Foundation, and by NSF grant DMS-0649473. 
V. Vu is supported by NSF Career Grant 0635606. 

1 



2 



TERENCE TAO AND VAN VU 



A basic problem in random matrix theory is to understand the distribution of singular 
values in the asymptotic limit n — > oo. Given an m x n matrix M, let 

°i (M) > ... > <r mill(n , m) (M) > 

denote the non-trivial singular values of M. Our paper will be focused on the "hard 
edge" of this spectrum, and in particular on the least singular value a n (M), in the case 
of square matrices m = n, but let us begin with a brief review of some known results 
for the rest of the spectrum. 

There is a general belief that (under reasonable hypotheses) the limiting distributions 
concerning the spectrum of a large random matrix should be "universal" , in the sense 
that they should not depend too strongly on the distributions of the entries of the 
matrix. In particular, one usually expects that the asymptotic statistical properties 
that are known for matrices with (independent) gaussian entries should also hold for 
matrices with more general entries (such as Bernoulli). A well-known conjecture (now 
a theorem) of this type is the Circular Law conjecture (see 0, Chapter 10] and [47]). 
Another example is that of Dixon's conjectures (see [2l)j Conjectures 1.2.1 and 1.2.2]). 

Universality has been proved for several statistics concerning the random matrix model 
M n (£). As is well known, the bulk distribution of the singular values of M n (£) is 
governed by the Marchenko-Pastur law. It has been shown that for any t > and any 
R- or C-normalized £, 

11 1 /-min(M) H 

-|{1 < i < n : -a,(M n (0) 2 < t}\ - — / \ - - 1 dx 

n n Ztt Jq V x 

as n — > oo, both in the sense of probability and in the almost sure sense. For more 
details, we refer to [2H ED EH E]- (In literature, very frequently one views cjj(M n (^)) 2 
as the eigenvalues of the sample covariance matrix M n (£)M n (£)* and so it is more 
traditional to write down the limiting distributions in term of a 2 .) 

The next objects to consider are the extremal singular values. The distribution of 
the largest singular value a% (and more generally, the joint distribution of the top k 
singular values) was computed for the gaussian case by Johansson [18] and Johnstone 
|19j . This distribution is governed by the Tracy- Widom law (and more generally, the 
Airy kernel). In particular, one has 



a 2 -4 
— n > TW 

2 4/3 n -2/3 

where TW denotes the Tracy- Widom distribution. More recently, Soshnikov [39] 
showed that the same result holds for all random matrices with normalized subgaussian 
entries. 

Now we turn to the "hard edge" of the spectrum, and specifically to the least singular 
value (7 n (M„(£)). The problem of estimating the least singular value of a random 
matrix has a long history. It first surfaced in the work of von Neuman and Goldstein 
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concerning numerical inversion of large matrices [30]. Later, Smale [38] made a specific 
conjecture about the magnitude of a n . Motivated by a question of Smale, Edelman 
computed the distrubition of cr n (£) for the real and complex gaussian cases £ = gu, gc 
®: 

Theorem 1.1 (Limiting distributions for gaussian models). For any fixed t > 0, we 
have 

P(w n (M n (g M )) 2 <t)= f l±^ e -(*/»W50 dx + (i) (i) 

Jo Z V X 

as well as the exact (!) formula 

P(na n (M n (g c )) 2 <t)= [ e~ x dx. 

Jo 

Both integrals can be computed explicitly. By change of variables, one can show 

/■* 1+Vz (s/a+vs) dx = 1 _ e -t/2-Vi_ (2 ) 
Furthermore, it is clear that 



t 

t 



e ~ x dx = l- e~\ (3) 



o 



In fact, one can compute the joint distribution of the bottom k singular values of 
Mn{gR) or M n (gc) for any constant k, as was done by Forrester [TTj. The formula, 
which involves the Bessel kernel, is more complicated and is deferred to Section [61 In 
[315] . a different approach was proposed by Rider and Ramirez, which lead to description 
that does not involve Bessel kernels directly. Ben Arous and Peche [H] generalized 
Forrester's result to matrices whose entries are gaussian summable. 

The error term o(l) in ([T]) is not explicitly stated in [U] (it relies on an asymptotic 
for the Tricomi function), but our Theorem 11.31 below will imply that it is of the form 
0{n~ c ) for some absolute constant c > 0. 

The proofs of the above results relied on special algebraic properties of the gaussian 
models gjR, gc (and in particular on various exact identities enjoyed by such models). 
For instance, Edelman's proof used the exact joint distribution of the eigenvalues of 
^M n M*, which are available in the gaussian case: 



(Real gaussian) ci(n) j J (Xj — Xj) ]~[ 1//2 exp(— Aj/2) (4) 

l<i<j<n i=l i=l 
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(Complex gaussian) 02(11) j [ |Aj — Aj| 2 exp(— Aj/2). (5) 

l<i<j<n i=l 

Here Ci(n) and 02(71) are normalizing factors. It appears that this approach do not 
extend to the case of more general R- or C-normalized models £, where the above 
formulae are not available. 

In the general setting (and in particular in discrete cases such as Bernoulli), it is 
already not trivial to show that the probability that cr n (M n (£)) is positive tends to one 
with n (this statement is, of course, obvious in the continuous case, such as gaussian, 
by a dimension argument). This was first done by Komlos [2"TT [22] ■ For more recent 
developments along this line we refer to [20, 46, 6J. These papers give better and better 
bounds on the rate of convergence (to one) of the probability in question, but do not 
give any quantitative estimate on a n . 

In the last few years, we have seen considerable progresses in the problem of estimating 
a n and its tail distribution. In [33], the present authors proved an (almost sure) lower 
bound for the absolute value of the determinant of a random Bernoulli matrix. As 
the absolute value of the determinant is the product of the singular values, this result 
implies an (almost sure) lower bound of the form exp(— n 1 / 2 ^ 1 )) for the least singular 
value. A significant breakthrough was achieved by Rudelson [31], who established 
a polynomial lower bound for er n (M n (£)) and also tail estimates for a certain range. 
Rudelson's results were then extended by several authors [33], [3B], [H], [13], [17], [IE], 
[4*9] . using the machinery of Inverse Littlewood-Offord theorems, introduced in |44j . 
For instance, under the assumption of bounded fourth moment E|£| 4 < 00, it was 
shown in [36] that 

P(na n (M n (0) 2 <t)<f(t) + o(l) 
for all fixed t > 0, where f(t) goes to zero as t — > 0; similarly, in [33J it was shown that 

P(na n (M n (C)) 2 >t)< g(t) + o(l) 

for all fixed t > 0, where g(t) goes to zero as t — > 00. Under the stronger assumption 
that £ is subgaussian, the lower tail estimate was improved in [36] to 

P(na n (M n (0f <t)< Ct 1 ' 2 + c" (6) 

for some constants C > and < c < 1 depending only on the subgaussian moments 
of £. At the other extreme, with no moment assumptions on £, the bound 

P(na n (M n (0) 2 < n- x -i A - A2 ) < n~ A+ °^ 
was shown for any fixed A > in [48]. 

A common feature of the above mentioned results is that they give good upper and 
lower tail bounds on ncx n (M n (£)) 2 , but not the distributional law. In fact, many papers 
[33| I3~6l 143] are partially motivated by the following conjecture of Spielman and Teng 
[ffil Conjecture 2]. 
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Conjecture 1.2. Let £ be the Bernoulli random variable. Then there is a constant 
< c < 1 such that for all t > 

P(vW(M n (£)) <t) <t + c n . (7) 

In this paper, we introduce a new method to study small singular values. This method 
is analytic in nature and enables us to prove the universality of the limiting distribution 
of na n (M n (0) 2 . 

Theorem 1.3 (Universality for the least singular value). Let £ be R- or C-normalized, 
and suppose E|£| Co < oo for some sufficiently large absolute constant Cq. Then for all 
t > 0, we have 

V{na n {M n {i)f <*)=/' i±^V^ 2+ ^ dx + 0(n~ c ) (8) 

Jo 

if £ is R-normalized, and 

P(na n (M n (0) 2 < t) = [ e~ x dx + 0(n~ c ) 

Jo 

if '£ is C-normalized, where c> is an absolute constant. The implied constants in the 
0(.) notation depend on E\£,\ c ° but are uniform in t. 

Figure Q] shows an empirical demonstration of the theorem above for Bernoulli and 
for gaussian distributions. 

Very roughly speaking, we will show that one can swap £ with the appropriate gauss- 
ian distribution or gc, at which point one can basically apply Theorem 11.11 as a 
black box. In other words, we show that the law of n<r n (M n (£)) 2 is universal with 
respect to the choice of £ by a direct comparison to the gaussian models. The exact 
formulae f Q l±^L e -( x / 2 +Vx) dx and J Q * e~ x dx do not play any important role. This 
direct comparison (or coupling) approach is in the spirit of Lindeberg's proof [23] of 
the central limit theorem, and was also recently applied in our proof of the circular law 

m\- 

Our arguments are completely effective, and give an explicit value for Co; for instance, 
Cq := 10 4 certainly suffices. Clearly, one should be able to lower Cq significantly (we 
have made no attempt to optimize in Cq, in order to simplify the exposition), but we 
will not explore this issue here. 

Theorem 11.31 can be extended in several directions, with simple modifications of the 
proof. For example, we can prove a similar universality result involving the joint distri- 
bution of the bottom k singular values of M n (£), for bounded k (and even some results 
when A; is a small power of n). Next, we can also consider rectangular matrixes where 
the difference between the two dimensions is not too large. Finally, all results hold if 
we drop the condition that the entries have identical distribution. (It is important that 



6 



TERENCE TAO AND VAN VU 



1.0- 




0.9- 




0.8- 




0.7- 




0.6- 




0.5- 


1 
1. 


0.4- 


I; 
/;' 


0.3" 


i; 


0.2- 


/■■ 

1: 


0.1 - 


( 
f 



/ 



/ 



— I 
4 



2 
x 



Figure 1. Plotted above is the curve P(y / ncr ri (M„(£)) < x), based 
on data from 1000 randomly generated matrices with n = 100. The 
dotted curve was generated with £ a random Bernoulli variable, taking 
the values +1 and —1 each with probability 1/2; and the dashed curve 
was generated with £ a gaussian normal random variable. Note that the 
two curves are already close together in spite of the relatively coarse 
data. 



they are all normalized, independent and their Co-moments are uniformly bounded.) 
For precise statements, see Section El 

It is clear that one can use Theorem II .31 to address Conjecture 11.21 By Theorem 11.31 
and (ED, the left hand side of © is 



P(na n (M n (£)) 2 < t 2 ) = f I±_^ e -W2+v^) dx + (n- c ) 

Jo 2yx 

= 1 -e~* 2/2 -' + 0(n~ c ). 

Since 1 — e~* 2 / 2- ' < t for any t > 0, we conclude that Conjecture 11.21 holds for any 
t > n~ c °, for some positive constant cq. More importantly, it shows that the main term 
t on the right hand side of (J7|) is only a (first order) approximation of the truth and 
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Figure 2. Plotted above is the curve P(y / na n (M n (£)) < x), based on 
data from 150,000 randomly generated matrices with n = 100. The 
curve with long dashes is the line y = x, and the solid curve is a plot 
of y = x — x 3 /3. The dotted curve was generated with £ a random 
Bernoulli variable, taking the values +1 and —1 each with probability 
1/2; and the dashed curve with spaces between the dashes was generated 
with £ a gaussian normal random variable. Note that the dotted curve 
and the dashed-with-spaces curve are completely overlapping, so that it 
appears to be a single cuver composed of dashes with dots inbetween. 
On the right side of the graph, the overlapping curves from the least 
singular values are distinctly lower than the line y = x and are very close 
to the curve y = x — x 3 /3. 



could certainly be improved. For example, for sufficiently small t, Taylor expansion 



Figure [2] provides empirical evidence that P (y/na n (M n (£)) <t) <t for the Bernoulli 
and gaussian cases. 



gives 



1 -e 



t 2 /2-t 
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The rest of the paper is organized as follows. In the next section, we describe our 
proof strategy. In Section El we turn this high-level strategy into a rigorous proof, 
using many technical lemmas from various areas of mathematics (linear algebra, theo- 
retical computer science, probability and high dimensional geometry). Many of these 
lemmas may have some independent interest. For instance, Corollary 15.21 shows that 
the distance from a random vector (in C") to a hyperplane spanned by n — 1 other 
random vectors has (asymptotically) gaussian distribution. This is obvious in the case 
when the coordinates of the vector in question are iid gaussian, but is not so when they 
are Bernoulli. (See also [36J for some related results in this spirit.) 

Notation. We consider n as an asymptotic parameter tending to infinity. We use 
X < Y, Y > X, Y = tt(X), or X = 0{Y) to denote the bound X < CY for all 
sufficiently large n and for some C which can depend on fixed parameters (such as Co 
or El^p ) but is independent of n. 

The Frobenius norm \\A\\p of a matrix is defined as ||^4||f = trace (AA*) 1 ^ 2 . Note that 
this bounds the operator norm ||^4|| p := sup{|Ar| : |x| = 1} of the same matrix. 

For a random variable X, E(X) is the expectation of X. If X is real, we denote by 
M(X) its median, namely a number x such that both P(X > x) and P(X < x) are at 
least 1/2. (If x is not unique, choose one arbitrarily.) 

For an event S, I? is its indicator function, taking value 1 if S holds and otherwise. 
Clearly E(I £ ) = P(£). 

2. The main idea and the proof strategy 

We now discuss our main idea and the strategy behind Theorem II .31 A formal version 
of this argument is given in Section [3j though for various minor technical reasons, the 
presentation there will be rearranged slightly from the one given here. 

Let us first reveal our main idea. To start, we are going to view a n (M n (£)) as the 
(reciprocal of the) largest singular value of M~ x (£). One of the most popular methods 
to study the largest singular value is the moment method, which enables one to control 
on o"i(M) (of a random matrix M) if one can have good estimates on trace {MM* ) k l 2 
for very large k. The moment method was used successfully by Sosnhikov [39J to study 
cri(M n (£)) . However, it is important in the applications of this method that the entries 
of M are independent and their distributions well-understood. Unfortunately, the 
entries of M~ x are highly correlated and not much is known about their distributions. 

We have found a new approach, motivated by the general idea of "property testing" , a 
topics popular in theoretical computer science and combinatorics. The general setting 
of a property testing problem is as follows. Given a large, complex, structure S, we 
would like to study some parameter P of S. It has been observed that quite often 
one can obtain good estimates about P by just looking at the small substructure of 
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S, sampled randomly. In our situation, the large structure is the matrix S := M" 1 , 
and the parameter in question is its largest singular value. It has turned out that 
this largest singular value can be estimated quite precisely (with high probability) by 
sampling a few rows (say s) from S and considering the submatrix S' formed by these 
rows. The heart of the proof then consists of two observations: (1) The singular values 
of 5" can be computed from a matrix obtained by projecting the rows of S 1-1 = M n 
onto a subspace of dimension s and (2) Such a projection has a central limit theorem 
effect. All these together allow us to compare the least singular value of M n (£) with 
the least singular value of M s (g) (properly normalized), proving the universality. 

Let us now be a little more specific. For sake of discussion we discuss the R-normalized 
case (JHJ). Our goal is to show 

cr n (M n (0) ^ cr n (M n (g R )) (9) 

where we will be deliberately vagueS as to what the symbol ~ means in this non-rigorous 
discussion. The C-normalized case will of course be very similar. 

We have 

a n (A)=a 1 (A- 1 )- 1 . (10) 

(From existing results, e.g. [20, HE], it is known that M n (£) is invertible with very high 
probability.) We now wish to show that 

( 7 1 (M ri (0- 1 )~(7i(M n (g K )- 1 ). 

Let Ri(£), ■ ■ ■ i Rn{0 denote the rows of M n (£)~\ and let s := [n £ \ for some small 
absolute constant e > 0. To estimate the largest singular value of the n x n matrix 
formed by the n rows . . . , R n (0 £ we use random sampling. More precisely, 

we create the s x n submatrix B Sjn (^) formed by selecting s of these rows at random. 
Actually, since the joint distribution of Ri(£), ■ ■ ■ , R n (0 is easily seen to be invariant 
under relabeling of the indices, we may just take B s ^ n (^) to be the matrix with rows 
-Ri(O) • • • j Rs(0- A n application of the second moment method (see Lemma 13.11 and 
its proof) will give us the relationship 

^(M^er 1 ) » ^(iMO) (u) 

provided that we have some reasonable bound on the magnitude of the rows . . . , R, 

(see Proposition 13.21) ; of course we expect the same statement to be true with £ re- 
placed by gR. Assuming it for now, we are now (morally) reduced to establishing a 
relationship of the form 

ai(B Sjn (£)) W <7l(£ Sin (gR)). 

The next step is to use some elementary linear algebra to replace the s x n matrix 
B s , n (£) with an s x s matrix M SjJl (£), defined as follows. Let Xi(£), . . . ,X n (£) e M. n 

1 Roughly speaking, X w Y means that the distributions of the random variables X and Y are close 
in some appropriately normalized Levy distance, where the appropriate normalization may change 
from line to line. 
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be the columns of M n (£ ) ; observe that these iid random variables are the dual basis of 
-Ri(O) ■ ■ ■ j Rn{&7 thus Ri(£) ■ Xj(£) = 5ij for 1 < i, j < n, where 5ij is the Kronecker 
delta. 

Let V s ,n{£) be the s-dimensional subspace of M™ defined as the orthogonal comple- 
ment of the n — s-dimensional space spanned by the columns X a +i(£), . . . , X n (£). We 
select an orthonormal basis on V Sjn (^) arbitrarily (e.g. uniformly at random, and inde- 
pendently of the columns -Xi(£), . . . , X s (£)), thus identifying K,n(0 with the standard 
s-dimensional space M. s . The orthogonal projection from M™ to V a>n (£) can now be 
thought of as a partial isometry ir : R n — > W. Let M SiTl (£) be the s x s matrix whose 
columns are ir(Xi(£)), . . . , tt(X s (£)). In Lemma |3~31 we will establish the simple identity 

«Ti(M a , n (e)) = ^(M a , n (o)- 1 , 

thus reducing our task to that of showing that 

a s (M s , n (0) « c7 s (M Si „(g R )). (12) 

The relation ( Tl2l looks very similar to our original relation ([9]) (indeed, when s = n, 
(112p collapses back to (jHJ)). However, the critical gain here is that ( Tl2l) becomes much 
easier to prove than ([9]) when s is only a small power of n, because the projection 7r 
will act to average out the random variable £ into (approximately) a gaussian vari- 
able (essentially thanks to the central limit theorem). Indeed, by using a variant of 
the Berry- Esseen central limit theorem (Proposition 13.41) . together with some non- 
degeneracy (or "derealization" ) properties of V s>n (£) (Proposition 13.51) . we will be able 
to establish a relation of the form 

M s , n (0 « M s {g R ) 

and similarly 

M s , n (g R ) w M S (^ M ) 

(indeed, it is not hard to see that M s>n (gR) and M s (gK) in fact have an identical 
distribution). The claim (112j) will then follow from the Lipschitz properties of a s , 
which is a consequence of the Hoefmann-Weilandt theorem (see Lemma IB. II) . 



3. The rigorous proof 

We now implement the strategy sketched out in Section [2] to give a rigorous proof of 
Theorem 11.31 This proof will rely on several key propositions which are proven in later 
sections or in the appendices. 

Let F be either the real field IR or the complex field C, and fix an F-normalized 
random variable £. We assume Co to be a sufficiently large constant to be chosen later 
(e.g. Co = 10 4 certainly suffices). All implied constants are allowed to depend on Co 
and E | £ | 00 . In all arguments, we assume n to be large depending on these parameters. 
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Fix t > 0, and write f(x) := *+vH e -(*/2+^) if p = E or f(x) := e~ x if F = C. Our 
task is to show that 

P(7W7 n (M n (£)) 2 < t) = f f(x) dx + 0{n- c ). (13) 

We first make a simple reduction. By hypothesis, E^l^ = 0(1). Hence by Markov's 
inequality, we see that P(|£| < n 10 ^ c °) = 0(n~ w ). Thus, by the union bound, we see 
that with probability at least 1 — 0(n~ 8 ), all coefficients of M n (£) are 0(n w ^ Co ). If 
we remove the tail event |£| > n 10 ^ Co from £, and readjust £ slightly to restore the 
F-normalization conditions (using (136|) to absorb the error, and using the continuity 
of /), we may thus reduce to the case when 

\t\<n w/c \ (14) 

with probability one. This is a standard truncation and re-normalization process, used 
frequently in random matrix literature (see, for instance, [5]). We omit the (routine, 
but somewhat tedious) details. 

For minor technical reasons, it is also convenient to assume £ to be a continuous 
random variable (in particular, this implies that M n (£) is invertible with probability 
one). However, we emphasise that the bounds in our arguments do not explicitly 
depend on the continuity properties of £, and instead depend only on the C moment 
of £ for any fixed n. Since one can express any discrete random variable with finite Cq 
moment (and obeying (fl4l) ) as the limit of a sequence of continuous random variables 
with uniformly bounded Co moment (and also obeying ( ED ), we see that the discrete 
case of the theorem can be recovered from the continuous one by a standard limiting 
argument (keeping n fixed during this process, and using (1361) as necessary). 

Applying (TIP]) , we have 

P(na n (M n (0) 2 <t) = P^M^C)- 1 ) 2 > n/t). 

Let Ri (£),..., R„,(£) denote the rows of M n (£) -1 . Since the columns Xi (£),... , X n (£) 
of M n (£) are exchangeable (i.e. exchanging any two columns of M n (£) does not affect 
the distribution), we see that M n (£) -1 is row-exchangeable. 

Our first step is motivated by the observation that in certain cases the largest singular 
values of a matrix can be well approximated by sampling. This fact is well-known in 
theoretical computer science and numerical analysis. In particular, the lemma below 
is a special case of more general results from [TU [7j. 

Lemma 3.1 (Random sampling). Let 1 < s < n be integers. A be an n x n real or 
complex matrix with rows . . . , R n . Let ki,...,k s G {1, . . . , n} be selected indepen- 
dently and uniformly at random, and let B be the s x n matrix with rows , • • • , Rk s ■ 
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Then 

n 

n „,,o n 



n\A*A--B*B\\l<-Y\R k \\ 



S S 

k=l 

The proof of Lemma 13.11 is presented in Appendix 

In order to apply Lemma 13.11 we need to bound the right hand side. This is done in 
the following proposition. 

Proposition 3.2 (Tail bound on |_Rj(£)|). Let Ri,...,R n be the rows of M n (^)~ l . 
Then 

P(max \RAi)\ > n im ' Co ) < n' 1 ' 00 . 

l<i<n 

We will prove this important proposition in Section [5l 
To continue, let E\ denote the event 



max \Ri(£)\ <n 100/Co . (15) 

l<i<n 

By Proposition 13.21 we have 

P(£i) > 1 -0{n-^ Co ). (16) 



Set 



s:= Ln 50 °/ Co J. (17) 



Sample a matrix A from the distribution M n (£). Let B be the submatrix formed by 
s random rows of A" 1 . If the rows of A~ l satisfies Si, then by Lemma 13.11 and the 
definition of s, we have 



~E(\a 1 (A~ 1 ) 2 - -o-ABfl 2 ) < n- w °/ Co n 2 . 
s 



By Markov's inequality, 



pfUaAA- 1 ) 2 - -ai(B) 2 \ > n" 40 ^) < n' 1 ^. 
s 

(The expectation and probability in the last two estimates are with respect to the 
random choice of the rows; the matrix A is fixed and satisfies £\.) 

Let £o be the event that 
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\a l {A' 1 f-- s ^{B) 2 \<n~^n. 
Finally, let £3 be the event that 

-aABfKnir 1 - n-^ *). 
s 

We are going to view both £ 2 , £3 as events in the product space generated by M n (£) 
and the random choice of the rows. A simple calculation shows that if £i,£%, £3 hold, 
then 

na n (A) 2 > t. 

It follows that 

P(mx n (M n (0) 2 > t) > P(^a 1 (fi) 2 < n{r l - n- 4 °/ Co )) - 0{rT l ' Co ). 
Arguing similarly, we have 

P (naJMM)) 2 <t)> P(-(n(5) 2 > nCT 1 + n~ m / Co )) - 0(n-^ Co ). 

s 

Our next tool is a linear algebraic lemma that connects the submatrices of A~ x to 
matrices obtained by projecting the row vectors of A onto a subspace. 

Lemma 3.3 (Projection lemma). Let 1 < s < n be integers, let F be the real or complex 
field, let A be an n x n F -valued invertible matrix with columns X\, . . . ,X n , and let 
Ri, . . . , R n denote the rows of A~ x . Let B be the s x n matrix with rows R\, . . . , R s . 
Let V be the s-dimensional subspace of F n formed as the orthogonal complement of 
the span of X s+ x, . . . ,X n , which we identify with F s via an orthonormal basis, and let 
7r : F n — > F s be the orthogonal projection to V = F s . Let M be the s x s matrix with 
columns tt(Xi), . . . , 7r(A s ). Then M is invertible, and we have 

BB* = ikr^ikT 1 )*. 

In particular, we have 

aj(B) = a s - ]+1 {M)~ l 

for all 1 < j < s . 

We prove this lemma in Appendix [B] 
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For any fixed value of the rows X a +x, . . . ,X n , we choose a projection ir : F n — > F s 
as above. The exact choice of it is not terribly important so long as it is made inde- 
pendently of the rows Xi, . . . , X s ; for instance, one could pick tt uniformly at random 
with respect to the Haar measure on all such available projections, independently of 
Xi, . . . , X s . 

Applying Lemma [3.31 twice, and noticing that since the rows of M n (£) have identical 
distribution, we can rewrite the inequalities in question as 



P(na n (M„(£)) 2 <t)< P(sa s (M s , n (0) 2 < t+) + 0{n~ x ^) 

and 

P(na n (M n (0) 2 <t)> P(sa s (M s , n (0) 2 < t_) - 0^°°) 
respectively, where 

1 

(with the convention that t+ = oo if t > n i0 ^ Co ) and 

1 

*- :_ t -l + n -40/C 

and M a>n (^) is the s x s matrix with rows 7r(Xi(£)), . . . ,tt(X s (£)), and ir : F n — > F s 
is the orthogonal projection to the s-dimensional space V = V, jn (£) orthogonal to the 
columns X s+ i(£), . . . , X n (£) of M n (£), where we identify V with i 7 " 3 via some orthonor- 
mal basis of V, chosen in some fashion independent of first s columns Xi(£), . . . , X s (£). 

The final, and key, point in the proof is to show that the distribution of M s ^ n {^) is 
very close to that of M s n (g). We are going to need the following high- dimensional 
generalization of the classical Berry-Esseen central limit theorem, which we will prove 
in Appendix [Di 

Proposition 3.4 (Berry- Esseen-type central limit theorem for frames). Let 1 < N < 

n, let F be the real or complex field, and let £ be F -normalized and have finite third 
moment E|£| 3 < oo. Let Vi,...,v n £ F N be a normalized tight frame for F N , or in 
other words 

v x v\ + ... + v n v* n = I N} (18) 
where In is the identity matrix on F N . Let S £ F N denote the random variable 

S = £iVi + ... + £ n v n) 

where £i, . . . , £ n are iid copies of£. Similarly, let G := (gj?i, . . . , Sf,n) £ F N be formed 
from N iid copies ofgp. Then for any measurable set Q C F N and any e > 0, one has 



P(G £ Q\d £ tt)-0(N 5/2 e- 3 {m&x \vA)) < P{S £ Q) 

l<j<n 

< P(G £ U d £ tt) + 0(N 5/2 e- 3 ( max \v j 

l<j<n 
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where 

d £ n := {x G F N : dist^x, dSl) < e}, 

dQ is the topological boundary ofQ, and and distoo is the distance using the l°° metric 
on F . The implied constant depends on the third moment E|£| 3 of £. 

In order to apply this result, we need to ensure non- degeneracy of the subspace V = 
K,n(0- This is done in the following proposition, which we will proved in Section |H 

Proposition 3.5 (V is non-degenerate). Let E c (£) denote the event that there does 
not exist a unit vector v = (vi, . . . ,v n ) in V S}n (£,) such that maxi<j< n \vi\ > n~ c . If c is 
a sufficiently small absolute constant, then 

P(^(e))«exp(-n n «). 

An important point here is that c does not depend on Co- For instance, we will be 
able to take c := 1/20. 

The idea that normal vectors are non-degenerate was considered in [33] and has since 
then become an important part in several papers on the least singular value problem, 
e.g. [34|, [35], [36], [37], [33], [3H], [37J, [3H]- The notion of non-degeneracy here is, 
however, somewhat different from those considered before. In the above mentioned 
papers, it was typically required that no small set of coordinates (say n' 99 coordinates) 
contains most of the mass of the vector (in other words, one cannot compress the vector 
into a much shorter one). In contrast, we require here that no individual coordinate 
can contain a significant (but not overwhelming) portion of the mass. 

Let us take this proposition for granted for now, and condition on 

x s+1 {0,...,x n (t) 

so that E c (£) holds; note that Xi(£), . . . , X s (£) remain iid under this conditioning. We 
can then write 

s n 

i=i j=i 

where the £y are iid copies of £, and is the s x s matrix with all rows vanishing 
except the i row, which is equal to 7r(e 3 -), where ej is the j th basis vector of F n . Since 
7r is a partial isometry, tttt* = I s , which implies that 

s n 
i=l j=l 

2 

where we view identify the space of s x s matrices with the vector space F s in the 
standard manner. 



16 TERENCE TAO AND VAN VU 

Now let Q C F s2 be the set of all s x s matrices A such that sa s (A) 2 < t + , thus 

s n 

i=i j=i 

Applying Proposition 13.41 (with iV := s 2 ) and the definition of the event E c , we thus 
have 

P( S a s (M s , n (0) 2 < t + \X s+1 (0, X„(0) < P(M s (g F ) eQU d £ Q) + 0(s 5 e- 3 n- c ) 
where e is a parameter to be chosen later. But by (IHU]) and the crude bound 

\\A\\op < ll^ll-F < S SUP 

l<i,j'<s 

for any s x s matrix A = {dij)i<i,j<s, we see that if A is an s x s matrix in Q U d £ Q, 
then 

o- s (A)<^ + 0( S 5) 

and thus 



P(sa s (M s ^)) 2 < t+||X s+1 (0, • • • , X n (0) 

< P(M s (g F ) < y^± + 0(se)) + 0(sV 3 n- c ). 

Setting e := n~ c ^ (say), we see from construction of s that s 5 e~ 3 n~ c = 0(n~ l l c °), if 
Co is large enough depending on c. Also, we can absorb the O(se) error into the main 

term by replacing t + by a slightly larger quantity, e.g. 

1 



++ ' max^- 1 -O(n- 30 /Co) )0 )' 

Integrating over all X s+1 (£), . . . , X n (£) obeying i? c , and then using Proposition 13.51 and 
the above argument, we conclude that 

P(na n (M n (0) 2 <t)< P(sa s (M s (g F )) 2 < t++) + 0(rr 1/c7 °). (19) 

A similar argument gives 

P(na n (M n (0) 2 <t)> P(sa s (M s (g F )) 2 < t__) - 0(rr 1/c °). (20) 

where 

1 



max^- 1 -O(n- 3 °/ c °),0)' 



At this point we could use Theorem 11.11 and obtain a qualitative version of Theorem 
11.31 This would result in error terms o(l) rather than 0(rT c ). However, one only 
needs a little more effort to obtain the full result. The estimates (fl9~I) . (|20|) hold with 
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n replaced by 2n; adjusting t appropriately (as well as the implied constants defining 
t ,£++), one obtains the bounds 

P(na n (M n (0) 2 <t)< P(2na 2n (M 2n (0) 2 < t++) + 0(n~ 1/C °) (21) 

and 

P(na n (M n (0) 2 <t)> P(2na 2n (M 2n (0) 2 < *_) - O(^ 1/Co ). (22) 

This bound is established under the hypothesis ( ED , but as discussed at the beginning 
of the section, we may easily remove this hypothesis. In particular, the above bounds 
now hold for the gaussian distribution £ = g F . Meanwhile, from Theorem ll.il we have 

lim P{na n {M n {g F )) 2 <t) = f f{x) dx. (23) 

Iterating^ f[2"TT) . f[2"2l and using (j2*31 to pass to the limit (and adjusting the implied 
constants in the definition of t , t ++ again), we conclude that 

P(na n (M n (g F )) 2 <t)< [ ++ f(x) dx + C^rT 1 ^ ) 

Jo 

and 

P(na n (M n (g F )) 2 <t)> [ f(x) dx + 0{n^)- 

Jo 

since / is absolutely integrable and decays exponentially at infinity, we thus have 

P(na n (M n (g F )) 2 < t) = [ f(x) dx + O^- 1 / 00 )- 

Jo 

Inserting this bound (with n replaced by s) into ffTT?]) . f[2T?|) and using the continuity of 
/ again, we obtain (jT3"]) . and Theorem 11.31 

Remark 3.6. The above argument used the explicit limiting statistics for the hard edge 
of the gaussian random matrix spectrum in Theorem 11.11 If one refused to use this 
theorem, the best one could say using the above argument is that there exists a random 
variable X F taking values in the positive real line and independent of n and £ such 
that 

P(X F < t - n- c ) - n- c < P(na n (M n (C)) 2 < t) < P(X F < t + r^ c ) + n~ c 

for any t > 0. (We can replace 0(n~ c ) by rT c by slightly changing the value of c.) 
In other words, the random variables ncx n (M n (£)) 2 converge at some polynomial rate 
with respect to the Levy metric to a universal limit X F depending only on the field F. 
Theorem 11.11 can then be viewed as a computation as to what this universal limit is. 



One way to interpret this iteration is to view pTj) . (|22[) as asserting that the random variables 
na n (M n (^)) 2 form a Cauchy sequence in the Levy metric. 
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4. Non-degeneracy of normal vectors 



In this section we prove Proposition 13.51 Let c > be a sufficiently small constant. 
By the union bound and symmetry, it suffices to show that 

P(M > n~ c for some v = {v x , ...,v n ) G V a>n , \v\ = 1) < exp(-n n(1) ). (24) 

Let v — (v i, . . . , v n ) be such that the event in (1241) occurs. Since v G V s>n , we have 

Bv = 

where B is the (fi-s)xn matrix with rows X 8 +i, . . . , X n . We write B = (Yj., 5'), where 
Yi G F"~ s is a column vector whose entries are iid copies of £, and B' is a n - s x n - 1 
matrix whose entries are also iid copies of £. We similarly write t> = (vi,v r ) where 
v' := (v 2, . . . , f n ) G F n_1 . Then we have 

Kivi + 5V = 0. (25) 



Our first tool here is the following lemma, which asserts that with overwhelming 
probability, a random matrix has many small singular values. We will prove this 
lemma in Appendix IF1 

Lemma 4.1 (Many small singular values). Let n > 1, and let £ be M.-normalized or 
^-normalized, such that |£| < n e almost surely for some sufficiently small absolute 
constant e > 0. Then there are positive constants c, d such that with probability 1 — 
exp(— n n ^), M„(£) has at least c'n l ~ c singular values in the interval [O,^ 1 / 2 ^]. 



The next tool has a geometric flavor. It asserts that the orthogonal projection of a 
random vector onto a large subspace is strongly concentrated. 

Lemma 4.2 (Concentration of the projection of a random vector). Let n > d > 1 be 
integers, let K > 1, let F be the real or complex field, let^ be F -normalised with |£| < K 
almost surely, and let V be a subspace of F n of dimension d. Let X := (£ 1; . . . , £ n ), 
where £i, . . . ,£„ are iid copies of £. Suppose that d > CK 2 for some sufficiently large 
absolute constant C . Then 

P(|7iy(X) -Vd\> Vd/2) < exp{-Q{d/K 2 )). 



This lemma is an extension of the special case considered in [43J, in which £ is 
Bernoulli. We are going to prove this lemma in Appendix [El 

Applying Lemma f4.ll to an (n — s) x (n — s) block of B and using the interlacing law 
(Lemma IB.2p . we conclude (given that C Q is sufficiently large) that with probability 



1 — exp(— n™ 1 '), B' has at least n 1 c ° singular values of size O(n2- C0 ), for some absolute 
constant Cq > 0. This implies the existence of a subspace V C F n ~ s of dimension 
d := L^ 1_C °J such that H^yi?'!!^ n^ c ° , where ny is the orthogonal projection to 
V. Let us condition on the event that this is the case (note that this event does not 
depend on v). Then we have 

\ir v B'v'\ < 



1 CO 
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Combining this with ( !25l) and the hypothesis |ui| > n c , we conclude that 

On the other hand, from Lemma fl~2l we see that with probability 1 — exp(— n^ 1 '), we 
have 

|7iy(Yi)| ~ Vd~ n ^- Co/2 . 

If we choose c sufficiently small depending on cq, we obtain a contradiction with prob- 
ability 1 — exp(— n n W). The claim ([24]) follows. 

5. A LOWER BOUND FOR THE DISTANCE BETWEEN A RANDOM VECTOR AND A 

RANDOM HYPERPLANE 

In this section we prove Proposition 13.21 This type of result is somewhat related to 
the lower tail bounds on the lowest singular value in the literature (e.g. [23], [31], [3d] . 
[37], [33], [33], [3E])- There are two basic means to obtain such bounds, namely via 
Littlewood-Offord theorems and via Berry- Esseen-type central limit theorems. We will 
rely solely on the second route (cf. [25], [36], Section 2.3]). 

Let us begin with a simple lemma. 

Lemma 5.1 (Distance from vector to hyperplane controls inverse). Let n > 1, let F 

be the real or complex field, let A be an n x n F -valued invertible matrix with columns 
Xi, . . . , X n; and let Ri, . . . , R n denote the rows of A" 1 . Then for any 1 < i < n, we 
have 

\R i \ = dist{X l1 V i y 1 
where Vi is the hyperplane spanned by X 1 , . . . , X^i, X i+ i, . . . , X n . 

One can directly verify this lemma using the identity AA -1 = I. It is also the special 
case s — 1 of Lemma 13.31 

By Lemma [5.11 the desired proposition is equivalent to the lower tail estimate 

P( min di < ^ 100/c °) < n" 1/c ° (26) 

l<i<n 

where di := dist(JQ, V^, Xi := Vi := Span(X 1; . . . , X^, X i+1 , X n ). 

To start, we first study the distribution of d\. Let us temporarily condition on (i.e. 
freeze) the vectors X 2 , . . . , X n , leaving Xi random, and let (v i, . . . , v n ) be a unit normal 
vector to the hyperplane spanned by X 2 , . . . ,X n . Applying Proposition 13.51 we see 
that with probability 1 — exp(— n n ^), we have maxi<j< n \vi\ < n~ c for some absolute 
constant c > 0. Suppose that this event occurs. If we write Xi = (£i, . . . ,^ n ), where 
£i) • • • 5 £n are iid copies of £, then we have 

d\ = + • • • +inV n \- 



20 TERENCE TAO AND VAN VU 

Applying Proposition 13 A\ we thus see that for any t > and e > 0, we have 

P(di <t)< P(|g F | < t + 0(e)) + 0(e- 3 n- c ); 

setting e := ra~ C//4 we conclude 

P(rfi<t)<P(|gF|<t)+0(n" c / 4 ). 

We can bound P(|g_p| < t) by 0(t) (when F = C, we have the stronger bound of 0(t 2 ), 
though we will not exploit this). By symmetry, we thus obtain the lower tail estimate 



P(di < t) < t + n~ c/A . (27) 

Notice that a slight modification of the above argument also yields the following 
corollary: 

Corollary 5.2 (Random distance is gaussian). Let X±, . . . , X n be random vectors 
whose entries are iid copies of £. Then the distribution of the distance d\ from X\ 
to Span(X2, . . . ,X n ) is approximately gaussian, in the sense that 



P(d 1 <t) = P(\g F \<t) + 0(n 



-c/4> 



A naive application of the union bound to ( 1271) is clearly insufficient to establish ( 1261) , 
since we have d distances to deal with. However, we can at least use that bound to 
show that 



P( min di < t) < L{t + n- c/i ) 

Ki<L 



for any 1 < L < n. 



The key fact that enables us to overcome the ineffectiveness of the union bound is 
that the distances di are correlated. They tend to be large or small at the same time. 
Quantitatively, we have 

Lemma 5.3 (Correlation between distances). Let n > 1, let F be the real or com- 
plex field, let A be an n x n F -valued invertible matrix with columns X\, . . . , X n , and 
let di := dist(Xj,Vi) denote the distance from Xi to the hyperplane Vi spanned by 
X\, . . . , Xi_i, X i+ i, . . . , X n . Let 1 < L < j < n, let Vlj denote the orthogonal comple- 
ment of the span of X L+ i, . . . , Xj_i, Xj + \, . . . , X n , and let n L j : F n — > Vlj denote the 
orthogonal projection onto Vlj. Then 



di > 



3 1 I V^-^ \ n L,j(Xij 



di 

We are going to prove this lemma in Appendix O 



3 Whcn F = K, one could also use the classical Berry-Esseen theorem, Proposition ID. 11 
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Fix L = \n 2 °/ Co \ and t := n~ 21 ^ c °; then (if Co is sufficiently large compared with c) 
we have 

P( min di > n~ 21/Co ) > 1 - O(^ 1/Co ). (28) 

l<i<L 

To control the other distances dj for L < j < n, we use Lemma 15.31 which gives the 
lower bound 

dj > (29) 

where ttlj is the orthogonal projection to the L + 1-dimensional space Vlj orthogonal 
to Xl+i, . . . , Xj-i, Xj+i, . . . , X n . 



From Lemma 14.21 and (1141) , we have 



P(\TT LJ (Xi) - VL + f | > VL + 1/2) < exp(- 



n 



l/GV 



(say) for all 1 < i < L < j < n. By the union bound, we thus have 

P(|7r Lj (Xi) - VL + 1| < VL+1/2 for aU 1 < i < L < j < n) > 1 - O(^ 1/Co ) (30) 
(say). If the events in ( 1281) . ( l30j) both occur, then we conclude from (1291) that 

^ > n~ 41 / Co 

for all L < j < n, and (1261) follows. The proof of Proposition 13.21 is complete. 



6. Extensions of Theorem 11.31 



In this section, we describe several extension of Theorem II .31 mentioned briefly at the 
end of the introduction. As an application, we show that our results determine the 
distribution of the condition number of M n (£), extending results of Edelman from [9]. 



6.1. Joint distribution of least singular values. The proof of Theorem 11.31 can be 
adapted to control the joint distribution of the bottom k singular values of M n (£), for 
k a small power of n. More precisely, define Afc n (£) e (IR + ) fc to be the random variable 

A fc ,n(0 : = K„(M„(0) 2 , • • . ,na n _ fc+ i(M n (0) 2 ). 
Then a routine modification of the proof of Theorem 11.31 gives 

Theorem 6.2 (Universality for Afc )Tl (£)). Let F be the real or complex field, let £ be 
F -normalized, and suppose E|£| < oo for some sufficiently large absolute constant 
Co. If c> is a sufficiently small absolute constant, then we have 

P(A M (g F ) G Q\d n -.Q) - n- c < P(A M (0 G Q) < P(A M (giO G n U d n -.Q) + n~ c 

for all 1 < k < n c and all measurable Q G F k , where d n ~cQ is the set of all points in 
F k which lie within n~ c (in l°° norm) of the topological boundary of CI. 
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Indeed, one simply repeats the arguments for Theorem II .31 (which is basically the k = 1 
case) but acquires losses of throughout the argument, which will be acceptable if 
we force k to be a sufficiently small power of n, and if one shrinks c as necessary; we 
omit the details. 

Now we focus on the case when k = 0(1) is independent of n, then Theorem 16.21 assert s 
that Afc in (£) converges in the (/c-dimensional) Levy metric to a limiting distribution on 
F k . The precise value of this limiting distribution is complicated to state, but can in 
principle be computed from the following asymptotics of Forrester [UJ in the complex 
case a = gc: 

Theorem 6.3 (Bessel kernel asymptotics at the hard edge, complex case). [TT] Let 

m > 1 be an integer, and let f e L°°(IR m ) be a symmetric function. Then the expression 

E /(4^ n (M n (g c )) 2 , . . . ,4na Jm (M n (g F )) 2 ) 

l<ii<. ..<i m <n 

converges as n — > oo to the limit 



f(ti, . . . ,t m ) det(K (ti, tj))i<ij< m dt\ . . . dt m 
where K is the Bessel kernel 



K (x,y) :-- 



2(x - y) 
where J u is the Bessel function 

J V ( X ) = — [ e -i(«*-»rint) dt. 
27r J-n 

In the real case a = gR, there is a similar formula, but Kq is a now a 2 x 2 matrix- valued 
kernel, also defined in terms of Bessel functions, which is somewhat more complicated 
to state; see [2H] for details. 



As mentioned above, these fc-point correlation asymptotics are, in principle, sufficient 
to reconstruct the asymptotic distribution of A^ n (^). For instance, this is done for 
k = 1 in [12] (in particular, recovering the results in Theorem ll.ip . and for k = 2 
and F = C in [13]. However, the computations rapidly increase in complexity as k 
increases, and a closed-form expression for the limit in the general k case does not 
appear to currently be in the literature. On the other hand, one can apply Theorem 
16.21 directly to results such as Theorem 16.31 to conclude that the same asymptotics are 
in fact valid for all C-normalized £ (and similarly that the asymptotics in [28J are valid 
for all M-normalized £); we omit the details. 

Figure [3] shows the joint distribution of the smallest two singular values for Bernoulli 
and gaussian distributions. 

Figure H] plots the functions P( v / ^o" ra _ fc (M n (£)) < x), where k = 0, 1,2, for Bernoulli 
and gaussian distributions. 
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Figure 3. Plotted above is the surface P(v / n<7 n (M n (£)) < 
x and v / n<r ri _i(M n (^)) < y), based on data from 1000 randomly gener- 
ated matrices with n = 100. Lighter shading indicates higher probability 
(on the vertical axis). The surface on the left was generated with £ a 
random Bernoulli variable, taking the values +1 and —1 each with prob- 
ability 1/2; and surface on the right was generated with £ a gaussian 
normal random variable. 



Bernoulli Gaussian 




0123456 0123456 



Figure 4. Plotted above are the curves P(y / na n _fc(M n (£)) < x), for 
k — 0, 1, 2 based on data from 1000 randomly generated matrices with 
n = 100. The curves on the left were generated with £ a random Bernoulli 
variable, taking the values +1 and —1 each with probability 1/2; and 
curves on the right were generated with £ a gaussian normal random 
variable. In both cases, the curves from left to right correspond to the 
cases k — 0, 1, 2, respectively. 
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6.4. Rectangular matrices. We can use our method to consider the least singular 
value of a rectangular matrix of the dimension (n — I) X n, where I is a constant. In 
fact, we can reduce this problem to the square case by adding a (dummy) set of / rows 
vectors which form an orthonormal complement of the n — I row vectors of the matrix. 
We consider these dummy vectors as the first I vectors of the (now square) matrix and 
repeat the proof. The additional vectors remain the same after the projection and by 
removing them we obtain an (s — I) x s matrix which is asymptotically gaussian. Thus, 
we will be able to compare the least singular value of the original matrix with this one. 

The limiting distribution of the least singular value of (n — /) x n random matrices 
with gaussian entries (for / constant) can be computed, at least in principle. However, 
we cannot find a reference for this. Thus, we are going to phrase the result here in the 
spirit of Remark 13 .61 To this end, M n _2 )ri ,(£) denotes the (n — I) x n random matrix 
whose entries are iid copies of £. 

Theorem 6.5 (Universality for the least singular value of rectangular matrices). Let 

£ be M- or C-normalized, and suppose E | ^ | ^ < oo for some sufficiently large absolute 
constant C . Let I be a constant. Let X := na n ^\M n _i , n (£) 2 ; X m := n<J n-iM n _^ n (g^) 2 
and X gc := ncr„_;M n _; iri (gc) 2 . Then there is a constant c > such that for any t > 

P(X < t - n~ c ) - n~ c < P(X SR <t)< P(X < t + n- c ) + n~ c 
if £ is ^.-normalized and 

P(X < t - n- c ) - n- c < P(X gc < t) < P(X < t + n- c ) + n~ c 
if ^ is C-normalized. 

Figure [5] plots the functions P(v / ^ n _ fc (M n (£)) < x), where k = 0, 1,2, for Bernoulli 
and gaussian distributions. 

To conclude this section, let us mention two important recent results. In [37], Rudelson 
and Vershynin obtained strong tail estimates for the smallest singular value of random 
rectangular matrices of all possible sizes. In [T5] Feldheim and Sodin considered the 
case when / is large, / = B(n) and proved universality for the distribution of the least 
singular value of random matrices with entries having sub-gaussian tails. (In this case 
the least singular value is large, of order ®(\/n), with high probability.) 

6.6. Random matrices with not necessarily identical entries. All results hold 
if we drop the condition that the entries of M n have the same distribution. In the 
proofs, it is only important that the these entries are all normalized, independent, and 
their Co-moment is uniformly bounded. The fact that they are iid copies of £ has not 
played any important role. The only place where this information was used is in the 
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Figure 5. Plotted above are the curves P(^a n -i(M n _i(£)) < x), for 
/ = 0, 1, 2 based on data from 1000 randomly generated matrices with 
n = 100. The curves on the left were generated with £ a random Bernoulli 
variable, taking the values +1 and —1 each with probability 1/2; and 
curves on the right were generated with £ a gaussian normal random 
variable. In both cases, the curves from left to right correspond to the 
cases I = 0, 1, 2, respectively. 

paragraph following Lemma 13.31 (this allows one to fix the random rows of B to be the 
first s). But we can easily avoid this by conditioning on the choice of the rows of B. 
Thus we have the following extension of Theorem 11.31 

Theorem 6.7. Let £jj be M- or C-normalized, independent random variables such that 
E^p < C\ for some sufficiently large absolute constants Co and C\. Then for all 
t > 0, we have 



Jo 

if are oil C-normalized, where c> is an absolute constant. The implied constants 
in the 0(.) notation depend on Fj\^\ Co but are uniform in t. 

The reader is invited to formalize the extensions of the results in the previous two 
subsections regarding joint distribution of the least fc-singular values and rectangular 
matrices. 

Figure [6] shows an empirical demonstration of this theorem. 




(31) 



if £ij are all "^-normalized, and 
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Figure 6. Plotted above is the curve P(y / n<r n (M n (£)) < x), based on 
data from 1000 randomly generated matrices with n = 100. The dotted 
curve was generated with £ a random variable taking the value with 
probability 1 — (l/2) c and taking the values +2 C//2 and — 2 C / 2 each with 
probability (l/2) c+1 , where c is the smallest integer representative of i+j 
mod 3. The dashed curve was generated the same way, but with c = i+j 
mod 4. The solid curve is a plot of 1 — exp(— x — x 2 /2), which is the 
limiting curve predicted by Theorem 16.71 

6.8. Distribution of the condition number. Let M be an n x n matrix, its condi- 
tion number k(M) is defined as 



Motivated by an earlier study of von Neuman and Goldstein [30], Edelman [9] studied 
the distribution of /c(M n (£)) when £ is gaussian. His results can be extended for the 
general setting in this paper, thanks to Theorem 11.31 and the well known fact that the 
largest singular value Oi(M n (£)) is concentrated strongly around 2y/n. 



k(M) := ai{M)/a n (M). 
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One can prove this lemma by combining |5> Theorem 5.8] with the tools in Appendix 
[F]or by following the arguments in [U [T7j . See also [35], [37] for references. 

Corollary 6.10. Let be R- or ^.-normalized, independent random variables such 
that E\£,\ Co < G\ for some sufficiently large absolute constants Cq and C\. Then for 
all t > 0, we have 

P(i- K (M n (0) = £ l ~ij=- e ~ {X ' 2+ ^ ) dx + °( n ~ C ) ( 32 ) 
if £ij are all ^-normalized, and 



P(-U(M n (£)) >t)= [ e~ x dx + 0(n- c ) 



if £ij are all C-normalized, where c> is an absolute constant. The implied constants 
in the 0(.) notation depend on E|£| Co but are uniform in t. 

Appendix A. Estimating singular values via random sampling 

Lemma A.l (Random sampling). Let 1 < s < n be integers. A be an n x n real or 
complex matrix with rows R\, . . . , R n . Let fci, . . . , k s G {1, . . . , n} be selected indepen- 
dently and uniformly at random, and let B be the s x n matrix with rows R^ , . . . , Rk s ■ 
Then 

n 

n „„. _„ n 



E\\A*A--B*B\\ 2 F < -Y \R k \ 4 . 



s s 

k=l 



Proof. Let denote the coefficients of A, thus Ri = (an, . . . , Oj n ). For 1 < i < j, the 
ij entry of A* A — -B*B is given by 



n 

n 



a ki&kj a kii a kij- (33) 

k=l 1=1 

For I = 1, . . . ,s, the random variables ak^ctka are iid with mean - Y^k=i^ki a kj an d 
variance 

^ n 1 " 

Vij := — |afcj| 2 |a/cj| 2 — |— a-ki a kj\ 2 , (34) 

U k=l n k=l 

and so the random variable (1331) has mean zero and variance —Vij. Summing over i, j, 
we conclude that 

2 n n 



EMM-^-B|n=^yjy>. 

i=l j=l 

Discarding the second term in f[3~4l) we conclude 

n n n 

E\\A*A - -B*Bf F < j^EE M'M 2 - 

i=i j=i k=i 

Performing the i, j summations, we obtain the claim. □ 
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Combining this lemma with the Hoefmann-Wielandt theorem (Lemma IB. II) we con- 
clude that 



EV(a,(A) 2 --a,(5) 2 ) 2 <-V|i? fc | 4 . 
z — ' s s z — ' 

4=1 k=l 

Now observe that if s < y/n, then by the standard birthday paradox computation^, 
the rows k±, . . . ,k s are distinct with probability Conditioning on this event, we 

thus see that if k±, . . . , k s are sampled from {1, . . . , n} without replacement, that 



n 



e5> ? ,(a) 2 - -am 2 ? « - E i^i 4 - ( 35 ) 



s s 

i=l k=l 



The inequality flHol) is valid for deterministic matrices A. If A is a random matrix (with 
the k±, . . . ,k s being sampled independently of A), then we may take expectations of 
( |35l) for each instance of A and conclude that 



eV(^(a) 2 - -am 2 ) 2 « -eVi^i 4 . 

z — ' s s z — ' 



fc=i 



Now assume that the random matrix A is row-exchangeable, i.e. the distribution of 
A is stationary with respect to row permutations Ri <-> For fixed and distinct 
fci, . . . , k s e {1, . . . , n}, the distribution of Y^=i( cr i( J ^) 2 ~ i "^-^) 2 ) 2 i s independent of 
the choice of k ly . . . , k s . Crudely bounding \R k \ by maxx< fc < n |i? fc |, we conclude 

Corollary A. 2 (Sampling of row-exchangeable matrices). Let s,n be integers with 
1 < s < \fn, let A be a row- exchangeable random n x n matrix with rows R\, . . . , R n , 
and let B be the s x n matrix with rows Ri, . . . , R s . Then 

n 2 

E J2(a t (A) 2 - -a t {Bf) 2 « -E( max \R k \) A . 

L — ' S S l<k<n 

i=l 

Remark A. 3. One could use row-exchangeability here to instead simplify E J^ =1 \ Rk\ A 
as nEl-Rxl 4 , but it will turn out that this will not be useful for us, as we will need to 
truncate max!<fc< ra \R^\ in any event in order to assure that E|i?i| 4 is bounded. 

Remark A. 4. In our applications, A is going to be the inverse of M n (£) (after condi- 
tioning away some bad but rare events in which M n (£) is close to degenerate), and 
the summand of most interest on the left-hand side is the i = 1 summand. We expect 
ai(A) to be of size about y/n, and we also expect the \R k \ to have size about 0(1) 
(thanks to Lemma 15.11 and basic heuristics about the distance between a random vec- 
tor and a random hyperplane). So, as soon as s is a non-trivial power of n, we expect 
Corollary IA.2I to yield an approximation of the form (fTTI) . 



4 0ne could remove this condition s < ^fri by reworking the second moment computation in Lemma 
IA.1I when k\ , . . . , k s are sampled with replacement (and thus have some slight correlation with each 
other), but we will not do so here since in our applications s will be a small power of n. 
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In this appendix we collect various basic facts from linear algebra which we will rely 
upon in this paper. We begin with the Hoeffman-Wielandt theorem, which controls 
the variation of singular values of a matrix using the Frobenius norm. 

Lemma B.l (Hoeffman-Wielandt theorem). For any two n x n self-adjoint matrices 
M, M', we have 

n 

53(<7i(M) - a t (M')) 2 < \\M - M%. 
As a corollary, for any two n x n matrices A,B, we have 

n 

J2^( A ) 2 - °i( B f? < \\ A * A - B * B \\l- 

i=l 

Proof. It suffices to prove the first inequality, which we rewrite as 

(f2^(M + N)-a t (M)f)^<\\N\\ F 
i=i 

for self-adjoint M, N. By the fundamental theorem of calculus (or a compactness 
argument) and the triangle inequality in I 2 , it suffices to show the infinitesimal version 

(X>*(M + eN) - ai(M)) 2 ) 1 / 2 < e\\N\\ F + o(e) 
i=i 

of this inequality for fixed M, N and sufficiently small e > 0. But if we diagonalise M, 
the left-hand side can be computed (up to errors of o(e)) as e times the I 2 norm of the 
diagonal of N, and the claim follows. □ 

Using the minimax characterizations 

o-j(A) = sup inf | Av \ 

V:dim(V)=j v ^ V: \ v \ =1 

of the singular values, where V ranges over j-dimensional subspaces of C n , one obtains 
Weyl's bound 



\\ajiA) - aj(B)\\ < \\A-B\\ op (36) 

for any m x n matrices A, B. Another easy corollary of the minimax characterization 
is 

Lemma B.2 (Cauchy interlacing law). Let A be an m x n matrix, and let A' be arxn 
matrix formed by taking r of the m rows of A. Then 

<Tj(A) > <Tj{A!) > <T j+m - r (A) 

for all 1 < j < min(r, n), with the convention that <Tj(A) = for j > min(m, n). 
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Next, we prove the elementary but crucial projection lemma (Lemma 13.31) that relates 
a random sample of an inverse matrix, to a randomly projected version of the original 
matrix. For the reader's convenience, we restate this lemma. 

Lemma B.3 (Projection lemma). Let 1 < s < n be integers, let F be the real or 
complex field, let A be an nxn F-valued invertible matrix with columns X 1; . . . , X n , and 
let Rx, . . . , R n denote the rows of A' 1 . Let B be the s xn matrix with rows R\, . . . , R s . 
Let V be the s-dimensional subspace of F n formed as the orthogonal complement of 
the span of X s+ i, . . . ,X n , which we identify with F s via an orthonormal basis, and let 
ir : F n — > F s be the orthogonal projection to V = F s . Let M be the s x s matrix with 
columns tt(Xi), . . . , ir(X s ). Then M is invertible, and we have 

BB* = ikT^ikT 1 )*. 

In particular, we have 

aj(B) = a s - j+1 (M)^ 

for all 1 < j < s. 

Proof. By construction, we have Ri ■ Xj = 5ij for all 1 < i,j < n. In particular, 
Rx, . . . , R s lie in V, and 

ir(Ri) ■ Tr{Xj) = Sij 

for 1 < i, j < s. Thus M is invertible, and the rows of M~ l are given by tt(Ri), . . . , n(R s ) 
Thus the ij entry of (M _1 )(M -1 )* is given by 7r(Ri) ■ ir(Rj), while the ij entry of BB* 
is given by Ri ■ Rj. Since Ri, . . . , R s lie in V, and 7r is an isometry on V, the claim 
follows. □ 



Appendix C. Correlation between distances 



Lemma C.l (Relationship between different distances). Let n > 1, let F be the real 
or complex field, let A be an n x n F-valued invertible matrix with columns Xi, . . . , X n , 
and let di := dist(Xj,Vi) denote the distance from Xi to the hyperplane Vi spanned 
by Xi, . . . , Xi-i, Xi+i, . . . , X n . Let 1 < L < j < n, let Vlj denote the orthogonal 
complement of the span of Xl+i, ■ ■ ■ , Xj-±, Xj + i, . . . , X n , and let ttlj '■ F n Vlj 
denote the orthogonal projection onto Vlj ■ Then 

, > \KLAXj)\ 

Remark C.2. In practice, we will be able to get good upper and lower bounds on 
\iTLj(Xi)\. The above lemma asserts that as long as the di, . . . , are bounded away 
from zero, all the other dj will also be bounded away from zero. 

Proof. By relabeling we may take j = L+l. For 1 < i < L+l, write Yi := ^L,L+i{Xi) € 
Vlj. By applying tcl,l+i to both Xi and Vi, we see that di is the distance from Yi to 
Y\, . . . , Yi_i, Y i+ i, . . . , Y L+ i for any 1 < i < L + 1. Since Vlj is isomorphic to F L+1 , we 
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see (on replacing n with L + 1, and Xi with Yi G V^j = F L+1 ) that we may assume 
that L + 1 = n, thus ttlj is trivial and our task is now to show that 

\X n \ 



d n > 



1 -i- v^™- 1 Hii ' 
1 Z^i=i a 



Let . . . , i? n be the rows of A -1 . By Corollary 15.11 di = l/\Ri\, so our task is now 
to show that 

1 ^|X f |. D| 

\Rn\ < TTTT + 2^ TyH"!-^!- 

On the other hand, observe (since Ri, . . . , R n is a dual basis to X\, . . . , X n ) that the 
vector 

A„ ^ Ji-i ■ JV r , 



\X n 2 ~f \x„ 

1=1 



is orthogonal to X%, . . . , X ra _i and has an inner product of 1 with X n , and thus must 
equal R n . By the triangle inequality and Cauchy-Schwarz we thus obtain the claim. □ 



Appendix D. Berry-Esseen type results 



Let us first state the classical Berry-Esseen central limit theorem (see e.g. [H]): 

Proposition D.l (Berry-Esseen theorem). Let v\, . . . , v n G R be real numbers with 

v 2 1 + ...+v 2 n = l, (37) 

and let £ be a W-normalized random variable with finite third moment E|£| 3 < oo. Let 
S 6R denote the random variable 

S = Vi£i + . . . + v n i n 

where £i, • • • , £ n are Hd copies of £. Then for any t GK we have 

|3\ 



P(S<t) = P(g R <t)+0(J2 

i=l 

where the implied constant depends on the third moment E|£| 3 o/£. in particular, by 
(|3*7|) we /iawe 

P(5<t) = P(g R <t) + 0(max luj-l). 

l<j<n 



Next, we prove the key Proposition I3.4[ which can be seen as a high-dimensional ex- 
tension of Proposition lD.il Let us restate this proposition for the reader's convenience. 

Proposition D.2 (Berry-Esseen-type central limit theorem for frames). Let 1 < N < 

n, let F be the real or complex field, and let £ be F -normalized and have finite third 
moment E|£| 3 < oo. Let v\,...,v n G F be a normalized tight frame for F N , or in 
other words 

viv{ + ... + v n v* n = I N , (38) 
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where In is the identity matrix on F N . Let S G F N denote the random variable 

S = £lVi + . . . + CnVn, 

where £1, . . . , £ n are iid copies of £. Similarly, let G := (gF,i, ■ ■ ■ , Sf,n) G F n be formed 
from N iid copies o/gj?. Then for any measurable set Q C F N and any e > 0, one has 

P(G G Q\d e Q)-0{N 5 / 2 e- 3 {meLX \ Vj \)) < P{S G O) 

l<j<n 

< P(G G fi U <9 e fi) + 0(A^ 5/ V 3 ( max K|)), 

l<j<n 

<9 £ ft :={xeF N : distoo(x, < e}, 

dQ is the topological boundary ofQ, and and distoo is the distance using the l°° metric 
on F N . The implied constant depends on the third moment E|£| 3 of £. 



Proof. We shall just prove the upper bound 

P(S G O) < P(G G QUd £ Q) + 0(iV 5/ V 3 (max \ Vj \)), 

l<j<n 

as the lower bound then follows by applying the claim to the complement of Q. 

Let ip : F — > M + be a bump function supported on the unit ball {x G F : |s| < 1} of 
total mass J^,^ = 1, let : F N —>■ R + be the approximation to the identity 

N 1 x- 
^ £iA r(xi, . . . , ajjy) := JJ ), 

i=i 

and let / : F n — > IR + be the convolution 

/(ar)= / ^ e , N {y)l n {x-y)dy (39) 

where 1^ is the indicator function of Q. Observe that / equals 1 on Q\d e Q, vanishes 
outside of Q U d £ Q, and is smoothly varying between and 1 on d £ Q. Thus it will 
suffice to show that 

\B(f(S)) -E(/(G))| « iV 5 /V 3 ( max \ Vj \). 

l<j<n 

We now use a Lindeberg replacement trick (cf. pH E2]). Let g F1 , . . . ,g Fn be n iid 
copies of gi?. From (1351) and the fact that the distribution of centered gaussian random 
variables are completely determined by their covariance matrix, we see that 

g' F>1 vi + ... + g' F ^v n = G. 

Thus if we define the random variables 



Sj ■= + • • • + ijVj + g Fjj+ i^ + • • • + g Fjn f n G F 
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we have the telescoping triangle inequality 

■n 

\E(f(S)) - E(/(G))| < \ E f^ ~ E/^i-OI. ( 4 °) 

3=1 

For each 1 < j < n, we may write 

S 3 = S 'j+ V r r S J l = ^ + Sfj^ 

where 

Sj := &V1 + • • • + tj-lVj-l + gJrj+iVj + • • • + gF,/„. 

By Taylor's theorem with remainder we thus have 

f(Sj) = m) + fcfo ■ V)/(SJ) + Ufa ■ V) 2 /(^) + Oflfcl 8 sup \(vj ■ V) 3 /(*)|) (41) 
and 

/(Vi) = /(5;)+g F , J (^-v)/(5;)+ig^( Wr v) 2 /(5;)+o(|g F , J i 3 su P ik--v) 3 /(^)I) 

(42) 

in the real case F = R, with a similar formula in the complex case F = C obtainable 
by decomposing £j,gF,j m t° rea l an d imaginary parts, which we will omit here. A 
computation using (1391) and the Leibnitz rule reveals that all third partial derivatives 
of / have magnitude 0(e~ 3 ), and so by Cauchy-Schwarz we have 

sup \{vj ■ V) 3 f(x)\ < |^| 3 iV 3/ V 3 . 
xeF n 

Observe that g^j are independent of Sj, and have the same mean and variance 
(in the complex case F = C, we would use the covariance matrix of the real and 
imaginary parts rather than just the variance). Subtracting (1411) from (I42p and taking 
expectations (using the bounded third moment hypothesis on £) we conclude that 

|E(/(^))-E(/(5 J _ 1 ))l«hfiV 3/ V 3 

and thus by (HO]) 

n 

\E(f(S)) - E(f(G))\ « iV 3 /V 3 N 3 - 

3=1 

On the other hand, by taking traces of (1381) we have 

n 

and the claim follows. □ 

Remark D.3. In our applications, iV and 1/e will be a very small power of n. The 
theorem then becomes non-trivial as soon as one obtains an upper bound of the form 
rT c on the vectors \vA for some absolute constant c > 0. 
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Remark D.4. Suppose £ was now an arbitrary complex random variable of zero mean 
and finite variance. If the covariance matrix of Re£ and Im£ was a scalar multiple of the 
identity, then one is essentially in the C-normalized case and Proposition ID. 21 applies. 
If instead the covariance matrix was degenerate and the vi,...,v n had purely real 
coefficients, then one is in the IR-normalized case and Proposition ID. 21 again applies. 
But the situation is more complicated when the covariance matrix has two distinct 
non-zero eigenvalues, and depends to the extent to which the phases of vi, . . . ,v n are 
aligned. The question of what happens to the least singular value of M n (£) in this case 
seems to be an interesting one (presumably, there is no alignment of phases and one 
should essentially revert to the C-normalized case), but we will not pursue it here. 



Let us state a powerful theorem of Talagrand. Let fij be probability spaces equipped 
with measures /Xj, i = 1, . . . , n and il — Sli x • • • x fl n be the product space equipped 
with the product measure \i = n\ X • • • X ji n . A point x G has coordinates (xi, . . . , x n ). 

For a unit vector a = (ai, . . . , a n ) with non- negative coordinates and x, y G SI, define 



Appendix E. Concentration 



V 




For a subset A C f2 and x G f2, let 



d a (x,A) 



mid a (x,y) 



and 



D(x, A) : 



sup d a (x, A). 

,|o|=l 



a. 



In practice, the following (equivalent) definition of D(x,A) is useful. Define 



U(x, A) := {s G {0, l} n , there isy G A such that yi = Xi if Si = 0}. 



It is not too hard (see [231 Chapter 4], [12]) to prove that 




inf 

seconv(70,A) 




(43) 



where dist, as usual, denotes the Euclidean distance. 
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Theorem E.l (Talagrand's inequality). [121 [23] F° r an V measurable set A C Q and 
r > 

fi(x, D(x, A) > r)[i(A) < exp(-r 2 /4). 
We will use heavily the following corollary of Theorem IE.ll 

Theorem E.2. Let D be the unit disk {z G C, \z\ < 1}. For every product probability 
jj, on D n , every convex 1-Lipschitz function F : C™ —>■ R ; and every r > 0, 

H(\F - M(F)\ > r) < 4exp(-r 2 /16), 

where M(F) denotes the median of F. 

This corollary is the complex version of [221 Corollary 4.10] and can be obtained by 
slightly modifying its proof. We provide the (simple) details for the sake of complete- 
ness. 

Proof. (Proof of Theorem IE. 2D Let A be a subset of D" and x be a point in D n . By 
(l4*H|) there is a point s G conv£7 (x, A) such that 



D(x, A) = 

By Caratheodory's theorem there are s 1 , . . . , s k G C\U(x, A) with k < n + 1 such that s 
is an affine combination of the s J . In other words, there are positive numbers c\, . . . , Ck 
such that Ylj=i c j = 1 an< i 



cis 1 H h c k s h . 



Let Sj (s^) be the ith coordinate of s (s- 7 ), respectively. Let yi be the point in A that 
corresponds to s J (with respect to the definition of U(x,A)). We have 



D(x,A) = \\s\ 



n k 



\ t=i j=i 



?1 fc 



>j i=i i=i 



Since |xj — ?/|| < 2 (here is the only place where we use the fact that D has bounded 
perimeter), it follows 



2D(x,A) = 2||s|| > 



\ 



i=i i=i 



,r\2 



x 



k 
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Notice that the right hand side is at least the Euclidean distance from x to the convex 
hull of A, so we have 



D(x, A) > -dist(x, convA). (44) 

Let F be a function as in the statement of the theorem. Let A := {x\F(x) < M(F)}. 
By the definition of median ^(A) > 1/2. Furthermore, as F is convex, so is A. On the 
other hand, if F(x) > M(F) +r, then by the Lipschitz property dist(x, A) > r, which, 
via (|44p . implies that D(x,A) > r/2. By Theorem IE. 11 we have 

fi{x,F(x) > M(F) +r)fi{A) < exp(-r 2 /16), 

which implies 

P(F(x) - M(F) > r) < 2exp(-r 2 /16). 
For the lower tail, set A := {x, F(x) < M(F) — r} and argue similarly. □ 

An easy change of variables reveals the following generalization of this inequality: if 
/i is supported on a dilate K ■ D n of the unit disk for some K > 0, rather than D ra 
itself, then for every r > we have 

fi(\F - M(F)\ > r) < 4exp(-r 2 /16ir 2 ). (45) 

Theorem IE.2I shows concentration around the median. In applications, it is usually 
more useful to have concentration around the mean. This can be done via the following 
lemma, which shows that concentration around the median implies that the mean and 
the median are close. 

Lemma E.3. Let X be a random variable such that for any r > 

P(\X - M(X)\ > r) < 4exp(-r 2 ). 

Then 

\E(X)-M(X)\ < 100. 
The bound 100 is ad hoc and can be replaced by a much smaller constant. 
Proof. Set M := M(X) and let F(x) be the distribution function of X. We have 
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pM-i+l 

E(X) = 53 / xdF(x) < M + 4 ^ W~ i2 <M+ 100. 

The lower bound can be proved similarly. □ 

Now we use Theorem IE.2I to prove Lemma 14.21 Let us first restate this lemma. 

Lemma E.4 (Concentration estimate). Let n > d > 1 be integers, let K > 1, let F be 
the real or complex field, let £ be F -normalised with |£| < K almost surely, and let V be 
a subspace of F n of dimension d. Let X := (£i, . . . , £ n ) ; where £i, . . . , £„ are iid copies 
of £. Suppose that d > CK 2 for some sufficiently large absolute constant C. Then 

P(| dist(X, V)-Vd\> Vd/2) < exp(-n(d/K 2 )). 

Proof. The map X i— > dist(X, V) is clearly convex and 1-Lipschitz. Applying ( |45l) we 
conclude that 

P(| dist(X, 7) - M(dist(X, V))\ > t) < exp(-fi(t 2 /^ 2 )) (46) 

for any t > 0. On the other hand, if 7r : F n — > 7 is the orthogonal projection to V, 
then (by the F-normalization of £) we have 

Edist(X, 7) 2 = EXVttX 

= trace(7r*7r) 

= dim(y) 

= d. 

By Chebyshev's inequality, we conclude that the median M(dist(X, V)) of dist(X, V) 
is at most 2\fd. A simple calculation reveals that the fact 

| dist(A, V) 2 - M(dist(A, V)) 2 \ > t 

implies 

| dist(X, V) - M(dist(X, V))\ > mm(t/Vd, 
(This is easiest to see by working in the contrapositive.) Thus, 

P(| dist(A, V) 2 - M(dist(X, V)) 2 \ > t) < exp(-n(t/K 2 )) + exp(~n(t 2 / dK 2 )) . 
Integrating this, we see that 

d = Edist(A, V) 2 = M(dist(A, V)) 2 + 0(K 2 ) + O(VdK); 
if d > CK 2 for a sufficiently large C, we conclude that 

0.9^ < M(dist(X, V)) < l.lVd 
and the claim follows from fl46l). □ 



Remark E.5. One can obtain a more precise result by also computing the variance of 
dist(A, V) 2 ; see jl3l Lemma 2.2] for this computation in the model case of the Bernoulli 
random variable. 
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Appendix F. Random matrices have many small singular values 
In this section, we prove Lemma [4.11 which we now restate. 

Lemma F.l (Many small singular values). Let n > 1, and let £ be ^-normalized or 
^-normalized, such that |£| < n £ almost surely for some sufficiently small absolute 
constant e > 0. Then there are positive constants c,c' such that with probability 1 — 
exp(— n n ^), M n (£) has at least c'n 1 ~ c singular values in the interval [0, n l l 2 ~ c ] . 

The first ingredient of the proof is the following result on the rate of convergence to 
Marchenko-Pastur law. 

Theorem F.2 (Rate of convergence to Marchenko-Pastur law). [S] There are positive 
constants C\,C2 such that the following holds. Let £ be R- or C-normalized random 
variable with bounded c\-moment. Then for any t > 0, 



P sup |-|{1 < i < n : -a,(M n (0) 2 < t}\ - — / \ - - 1 dx\ > n~ c > = o(l). 



In other words, the ESD of the Wishart matrix -M*M n converges to Marchenko- 
Pastur law with rate n~ C2 almost surely. 

The values for c±,C2 are quite reasonable. In p2, Theorem 8.29], it is shown that one 
can set c\ = 6 and C2 = 1/6 — e, for any fixed e > 0. A more recent result [HJ Theorem 
1.2] claimed that one can set c\ = 4 and C2 = 1/2 — e. 

This theorem, however, does not imply the desired claim, as it only shows that M n 
has many small singular values with probability 1 — o(l), while we need the failure 
probability to be exponentially small. 

For a constant 1/2 > c > 0, let Y c be the number of singular values in the interval 
[0, n 1,2 ~ c ] . From Theorem E2 we can at least conclude that 




n 



1 




E(Y C ) = ft(n 1_c )- 



(47) 



We are going to show that for some sufficiently small positive constant a 



P(Y C < an 



l-c 



) < exp(-n Q(1) ). 



(48) 



One can achieve this goal by following, with few minor modifications, the powerful 
approach introduced by Guionnet and Zeitouni in [17]. We present the details for the 
sake of completeness and the readers' convenience. 
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Consider a random hermitian matrices Wn with independent entries Wij, 1 < i < j < 
N with support in a compact region S with diameter K. Let / be a real, convex , 
1-Lipschitz function and define 



N 
i=l 

where Aj are the eigenvalues of -^=Wn. We are a going to view Z as the function of 
the variables Wij, 1 < i < j < TV. The main difference between the current setting 
and that of [T7j is that here we do not require the real and imaginary parts of Wij be 
independent. 

We use the following two lemmas from |17j . 
Lemma F.3. Z is a convex function. 

This lemma is a consequence of [171 Lemma 1.2(a)]. 
Lemma F.4. Z is \^2-Lipschitz. 

This lemma can be derived from [T7J Lemma 1.2(b)]. We give here a short, different, 
proof. 

Proof. (Proof of Lemma IF.4D Consider two matrices W and W with entries and 
w'i , and eigenvalues A, and A^ (in decreasing order) , respectively. By Lemma IB.H 



A' 



5> 



A'| 2 | < 



1 

iV 1 



w-w'\\ 2 F 



< 



2 

N 



E 

l<«<j<A r 



On the other hand, by Cauchy-Schwarz and the fact that / is 1-Lipschitz, 



N 

\z -zf< Nj2\f(k) - f(M 2 <m*i- K 

i=l 

It follows that 



\z-z'\<V2{ Y, K'-<-l 2 ) 1/2 



l<i<j<7V 

completing the proof. □ 
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By Theorem IE.21 we obtain the following theorem, which is an extension of [TTj 
Theorem 1.3] to the case where the real and imaginary parts of iUy are not necessarily 
independent. 

Theorem F.5. Let Wn, f, Z be as above. Then there is a constant c > such that for 
any T > 

P(\Z - M(Z)\ >T)> Aexp(-cT 2 /K 2 ), 
where K is the bound on the absolute values of the entries ofW^. 

It will be better for us to replace M(Z) by E(Z). By Lemma [E. 31 and rescaling, 

\M(Z)-E(Z)\ = 0{K 2 ). 
Thus, if K 2 = o(T), then (by adjusting c if necessary) we have 

P(|Z - E(Z)\ > T) > Aexp(-cT 2 /K 2 ). (49) 

Recall that we are considering the singular values of a (random) non-hermitian matrix 
M n instead of the eigenvalues of a hermitian matrix Wn- This problem can be easily 
dealt with by the standard trick of defining N := 2n and 

w - ( 

Wn - [m: o ) ■ 

It is well known that if the singular values of M are G\ , . . . , a n , then the eigenvalues 
of W are ±<Ji, . . . , ±<r n . The number of singular values of -^M in [0, n~ c ] is thus half 

of the number of eigenvalues of -^W in / := [— n~ c , n~ c ]. In order to estimate the last 
quantity, it is natural to define 

N 

Z:=J2xi(\) 

i=i 

where xi is the indicator function of I and Aj are the eigenvalues of Wn- This function 
is, however, not convex and Lipschitz. On the other hand, we can easily overcome this 
problem by constructing two real functions /i, fi such that 

• fj are symmetric, convex and jj|-Lipschitz, for some sufficiently large constant 
C. 

• fi{x) = f-2,{x) for any x $L I. 
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• f 2 (x) + 1 > fi(x) > f 2 (x) for any x G I. 

• fl( x ) ~ f2{x) > 1/2 for any x G |J. 

Define Z 1; Z 2 with respect to /i, / 2 . Applying (j4"91) to Zj with T := cm 1_c , for some 
small positive constant a (to be chosen), we have 

P(|Z i - E(^-)| > T) < 4exp(-^^) = exp(-n n «), (50) 

given the fact that |/| = 0(n~ c ) where c is sufficiently small and that K (the bound 
on the absolute values of the entries) is a sufficiently small power of n. 

By ( |47l) and choosing a sufficiently small, we can assume E(Zi) — E(Z 2 ) > 3an 1_c . 
By the triangle inequality and (1501) . it follows that 

P(Z X - Z 2 < an 1 "') < exp(-n n(1) ). 
The desired bound fHHj) follows from this and the fact that 



iV 

i=i 

N 
i=l 

= Z\ — Z 2 . 
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